All of the code that we will work through this semester will be
stored as RMarkdown files, which have a .Rmd extension.
These files are great because they allow us mix code and descriptions
within the same file. Reading these notes will give you brief overview
of how this works; we will practice hands-on in class.
When opening an RMarkdown file in RStudio, you should see a window similar to this (it will be slightly different on Windows and depending on your screen size):
On the left is the actual file itself. Some output and other helpful bits of information are shown on the right. There is also a Console window, which we generally will not need. I have minimized it in the graphic.
Notice that the file has parts that are on a white background and other parts that are on a grey background. The white parts correspond to text and the grey parts to code. In order to run the code, and to see the output, click on the green play button on the right of each block.
When you run code to read or create a new data set, the data will be listed in the Environment tab in the upper right hand side of RStudio:
Clicking on the data will open a spreadsheet version of the data that you can view to understand the structure of your data and to see all of the columns that are available for analysis:
Going back to the RMarkdown file by clicking on the tab on the upper row, we can see how graphics work in R. We have written some code to produce a scatter plot. When the code is run, the plot displays inside of the markdown file:
Make sure to save the notebook frequently. However, notice that only the text and code itself is saved. The results (plots, tables, and other output) are not automatically stored. This is actually helpful because the code is much smaller than the results and it helps to keep the file sizes small. If you would like to save the results in a way that can be shared with others, you need to knit the file by clicking on the Knit button (it has a ball of yarn icon) at the top of the notebook. After running all the code from scratch, it will produce an HTML version of our script that you can open in a web browser:
In fact, the notes that you are currently reading were created with RMarkdown files that are knitted to HTML.
Markdown is a lightweight markup language used to format text for the web. RMarkdown documents combine markdown prose with R code. Here are some of the key things you will need to know how to do with markdown:
# symbol. The number of
# symbols determines the size of the header.# Big header
The quick brown fox jumped over the lazy dog.
## Smaller header
Lorem ipsum dolor sit amet, consectetur adipiscing elit.* or - for bullet
points and numbers for ordered lists:You will see all of these elements (and more) in the
.Rmd notebooks that you will complete for homework.
For more information, see the R Markdown guide.
We now want to give a very brief overview of how to run R code. We will now only show snippets of R code and the output rather than a screen shot of the entire RStudio session. Though, know that you should think of each of the snippets as occurring inside of one of the grey boxes in an RMarkdown file.
In one of its most basic forms, R can be used as a fancy calculator. For example, we can divide 12 by 4:
## [1] 3
We can also store values by creating new objects within R.
To do this, use the <- (arrow) symbol, which is called
the “assignment operator.” For example, we can create a new object
called mynum with a value of 8 by:
We can now use our new object mynum exactly the same way
that we we would use the number 8. For example, adding it to 1 to get
the number nine:
## [1] 9
Two things to note about object names:
Mynum != mynum.<-
assignment. For instance:## [1] -10
Object names must start with a letter, but can also use underscores and periods. This semester, we will use only lowercase letters and underscores for object names. That makes it easier to read and easier to remember what you have called things.
A function in R is something that takes a number of input values and returns an output value. Generally, a function will look something like this:
Where arg1 and arg2 are the names of the
inputs (“arguments”) to the function (they are fixed) and
input1 and input2 are the values that we will
assign to them. The number of arguments is not always two, however.
There may be any number of arguments, including zero. Also, there may be
additional optional arguments that have default values that can be
modified.
Let us look at an example function: seq. This function
returns a sequence of numbers. We will can give the function two input
arguments: the starting point from and the ending point
to.
## [1] 1 2 3 4 5 6 7
The function returns a sequence of numbers starting from 1 and ending at 7 in increments of 1. The return values are shown (in this document) right below the code block. Note that you can also pass arguments by position, in which case we use the default ordering of the arguments. Here is the same code but without the names:
## [1] 1 2 3 4 5 6 7
There is also an optional argument by that controls the
spacing between each of the numbers. By default it is equal to 1, but we
can change it to spread the point out by half spaces.
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0
We will learn how to use numerous functions in the coming notes.
When you forget how to use a function (we all do!), you can look up the documentation like so:
This also works:
This will pull up the documentation for the seq function
in the Help tab in RStudio. You can also search for functions in the
Help tab by typing in the search bar.
In these notes we will be working with data that is stored in a tabular format. Let’s start with the vocabulary of tabular data:
- A variable (or feature) is a quantity, quality, or property that you can measure.
- A value is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.
- An observation is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. We’ll sometimes refer to an observation as a data point.
- Tabular data is a set of values, each associated with a variable and an observation. Tabular data is tidy if each value is placed in its own “cell”, each variable in its own column, and each observation in its own row.
from R for Data Science (2023)
Here is an example of a tabular data set of food types, which has nine rows and five columns. Each row tells us the nutritional properties contained in 100 grams of a particular type of food.
Every row of the data set represents a particular object in our data set, each of which we call an observation. In our food type example, each individual food corresponds to a specific observation:
The columns in a tabular data set represent the measurements that we record for each observation. These measurements are called variables or features. In our example data set, we have five features which record the name of the food type, the food group that the food falls into, the number of calories in a 100g serving, the amount of sodium (mg) in a 100g serving, and the amount of vitamin A (as a percentage of daily recommended value) in a 100g serving.
A larger version of this data set, with more food types and
nutritional facts, is included in the course materials. We will make
extensive use of this data set in the following notes as a common
example for creating visualizations, performing data manipulation, and
building models. In order to read in the data set we use a function
called read_csv and pass it a description of where the file
is located relative to where this script is stored. The data is called
foods.csv and is stored in the folder data.
The following code will load the foods data set into R, save it as an
object called food, and prints out the first several
rows:
## # A tibble: 61 × 17
## item food_group calories total_fat sat_fat cholesterol sodium carbs
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Apple fruit 52 0.1 0.028 0 1 13.8
## 2 Asparagus vegetable 20 0.1 0.046 0 2 3.88
## 3 Avocado fruit 160 14.6 2.13 0 7 8.53
## 4 Banana fruit 89 0.3 0.112 0 1 22.8
## 5 Chickpea grains 180 2.9 0.309 0 243 30.0
## 6 String Be… vegetable 31 0.1 0.026 0 6 7.13
## 7 Beef meat 288 19.5 7.73 87 384 0
## 8 Bell Pepp… vegetable 26 0 0.059 0 2 6.03
## 9 Crab fish 87 1 0.222 78 293 0.04
## 10 Broccoli vegetable 34 0.3 0.039 0 33 6.64
## # ℹ 51 more rows
## # ℹ 9 more variables: fiber <dbl>, sugar <dbl>, protein <dbl>, iron <dbl>,
## # vitamin_a <dbl>, vitamin_c <dbl>, wiki <chr>, description <chr>,
## # color <chr>
Notice that the display shows that there are a total of 61 rows and
17 features. The first 10 rows and 10 columns are shown. At the bottom,
the names of the additional feature names are given. As described above,
if you run this RStudio, you can view a full tabular version of the data
set by clicking on the data set name in the Environment tab. The
abbreviations <chr> and <dbl> tell
us which features are characters (item,
food_type, wiki, description, and
color) and which are numbers (all the others).
If you prefer to type, the following command does the same thing:
Many of the examples in the following notes will make use of this foods data set to demonstrate new concepts. Another related data set that will be also be useful for illustrating several concepts contains the prices of various food items for over 140 years. We can read it into R using similar block of code, namely:
## # A tibble: 146 × 14
## year tea sugar peanuts coffee cocoa wheat rye rice corn barley
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1870 129. 151. 203. 88.1 78.8 88.1 103. 83.5 121. 103.
## 2 1871 132. 167. 222. 109. 66.7 118. 105. 84.5 88.4 130.
## 3 1872 134. 162. 189. 140. 71.6 122. 102. 92.9 69.2 125.
## 4 1873 136. 154. 179. 173. 65.8 116. 106. 91.0 67.1 166.
## 5 1874 146. 153. 231. 187. 69.9 113. 126. 99.6 128. 174.
## 6 1875 149. 150. 197. 176. 69.4 110. 116. 85.8 127. 161.
## 7 1876 150. 160. 172. 184. 80.7 114. 106. 95.3 91.2 132.
## 8 1877 149. 189. 153. 198. 87.8 144. 97.0 108. 94.5 125.
## 9 1878 150. 165. 160. 169. 96.0 115. 91.6 114. 82.2 121.
## 10 1879 144. 158. 133. 149. 108. 118. 113. 110. 78.7 124.
## # ℹ 136 more rows
## # ℹ 3 more variables: pork <dbl>, beef <dbl>, lamb <dbl>
Here, each observation is a year. Features correspond to specific
types of food. Notice that this is different than the foods
data set, in which the food items were observations.
It is very important to properly format your code in a consistent way. Even though the code may run without errors and produce the desired results, it is extremely important to make sure that your code is well-formatted to make it easier to read and debug. We will follow the following guidelines:
+ and *)It will make your life a lot easier if you get used to these rules right from the start. We will practice and review this in class.
For those of you who prefer to use keyboard shortcuts, there are two essential shortcuts, and many optional ones.
The first essential shortcut is the Command
Palette. Cmd + shift + p on Mac or
Ctrl + shift + p on Windows. This will open the command
palette, which allows you to search for any command in RStudio.
If you memorize this command, you don’t have to memorize the other
commands.
The second is the shortcut to run the current line of code. This is
Cmd + Enter on Mac or Ctrl + Enter on Windows.
This will run the current line of code in the console. If you have
multiple lines of code selected, it will run all of them.
Here is the complete keyboard shortcut list in the RStudio documentation.
At at then end of each set of notes, such as this one, will be a short set of questions or activities to complete before the next class. Bring written solutions with you to class.
Make sure you have R, RStudio, and all of the packages installed. If you are still having trouble with anything, please let me know during class.
On a piece of paper, make an example of a tabular data set with five rows and three columns. This can capture any type of information you would like. (If you don’t have any ideas, try your “to do” list.) We will share these together in class.
Give each of the columns of your data set names. Follow the variable name rules described above.
Once you have finished reading and completing the items above, make sure you have responded to the survey.